Lecture 4: Debugging, and data visualization with tidyverse!

CSSS 508

Author

Jess Kunke

Published

October 22, 2024

Updates and reminders

  • Homework and peer reviews (including for HW2) will be graded for completion
    • This is a one-credit course and I want the focus to be on practice, not points
  • Reminder: this week’s office hour is in Thomson, not Padelford (see website for details)
    • Does the office hour time work for people?

Outline for today?

First half of today:

  • Quick review of topics that came up on Ed Discussion
  • Practice with debugging

Second half of today:

  • Pivoting tables and data
  • Introduce plots with tidyverse!

Quick review

Code chunk options

There are several code chunk options, including echo, include, and eval. You can review them in the Quarto documentation here.

  1. Which chunk option do we use…

    • to show/hide the code in the rendered html document?
    • to run/not run the code?
    • to run the code but not show any output or messages from it?
  2. Which chunk option(s) should we use when we load packages at the beginning of the Quarto document?

  3. Which chunk option(s) should we use if we want to show the code we used to make a graph, but we don’t want to actually run it (we might do this in a tutorial, like to show how to install a package)?

Formatting text to look like code

To do this, surround the text with backticks (`): `mutate()` will become mutate().

Debugging

Debugging can be frustrating, tiring, and crazymaking for everyone, no matter how long you’ve been coding.

Take a breath, step back, and take a fresh look. Trust in yourself that you can figure it out.
Figure 1: If at first (or at the 15th try) you don’t succeed… you are not alone! With practice, you’ll catch bugs faster and you’ll code in ways that help you avoid errors in the first place. You’ll still have to troubleshoot, but you won’t get caught up so often in the same errors that used to stump you. Embrace the struggle and practice productively with the tips below. (Illustrations: Allison Horst.)

General principles and tips:

  • Test your code/render your document frequently so that any new errors must be coming from a small set of recent changes you made
  • Keep calm and be patient with yourself and with R
  • Break the problem down into testable pieces, ruling out possible explanations one at a time
    • Read and maybe google the error messages
      • Some are more useful to google than others; you get more intuition for this over time
    • Identify spots in the code that are likely causing the issue
      • Helpful to test/render frequently
      • Naming your Quarto code chunks can help to identify the spot that’s causing the issue (we’ll talk about this today)
      • Look for clues from the error messages you get, like object XXX not found or an issue with a particular function
    • Change one of the possible problems, and test if you still get the error
  • Step through your code line by line, thinking about what is defined so far and what the result of each line is
Your turn

Here’s an example that brings together a bunch of different concepts we’ve seen. Try this in small groups, then we’ll go over it as a class.

I tried to write code to convert the population to millions and filter for just the rows with a population of at least 10 million. Why isn’t it working? How would you modify the code to get it working?

Note:

  • You’ll probably have to debug this code in multiple steps
  • Just because the final result doesn’t print an error message doesn’t guarantee it’s working; make sure to check whether at the end you have the result you wanted
library(gapminder)
mutate(gapminder, v3 = pop/10^6)
filter(gapminder, pop > 10)

How to read a Quarto rendering error message

When you try to render a Quarto document and it fails, you get an error message something like this:

Let’s break this down.

  • The thing with the arrow ==> at the top is the command that RStudio runs when you click the Render button.
  • Then it says it’s processing your file, and it shows some of the progress it made through rendering
    • The “unnamed-chunk-2” refers to a code chunk in your code
    • It says unnamed because we didn’t name the chunk (we hadn’t talked about how to do that anyway). If you had named it, the name would appear there instead of “unnamed-chunk-2”.
  • The actual error message starts with “Quitting from ….”
    • The part that tells you something about what’s happening is the error message itself: object 'true' not found. This suggests that somewhere you typed a “true” where it doesn’t know what to do with it.

Naming chunks

If I render the following Quarto document:

I get the following error message:

If I take the same Quarto document and just name my code chunks like so:

Then I get somewhat more helpful output about the progress before the error:

Your turn

Make a Quarto document that looks like the second one above, with the named chunks.

Then identify and fix the error(s) to get it rendering successfully.

Render frequently!

Rule number 1: render frequently! But of course sometimes you forget, or you have other reasons that you add a lot of code or make a lot of code before rendering. Let’s talk about how to troubleshoot in that case: sequentially cut out code and try rendering again.

  1. Cut (as in cut and paste, not as in delete) everything after the first code chunk.
  2. Try to render the document again.
    1. If it doesn’t render successfully, you know that at least the first issue is in the small document you just tried to render: the YAML header, some Markdown text/formatting, or the first code chunk.
    2. If it renders without errors, then the issue must be happening somewhere in the part that you cut. Paste it back, and now repeat step 1 but cut everything after the second code chunk. And so on.

Once you identify what line is triggering the error message, then you can work on figuring out what caused the error. Remember: this is not necessarily the code you want to edit. The part of the code you modify to fix the issue could be in an earlier line of code or an earlier code chunk.

Your turn

In fact let’s practice that right now! In the qmd file below,

  1. Which line is generating the error message? (To which line is the error message referring?)
  2. Where would you edit the code to fix the error?

Pivoting tables and data

Recall where we left off last time with data manipulation in tidyverse: We wanted to summarize how many observations we had for each year and each continent, so we made the following table.

n_obs_by_year_cont <- gapminder %>%
  group_by(year, continent) %>%
  summarize(n.obs = n())

kable(head(n_obs_by_year_cont, n=12))
year continent n.obs
1952 Africa 52
1952 Americas 25
1952 Asia 33
1952 Europe 30
1952 Oceania 2
1957 Africa 52
1957 Americas 25
1957 Asia 33
1957 Europe 30
1957 Oceania 2
1962 Africa 52
1962 Americas 25

This table has a separate row for each year-continent combination. What we would probably find more readable is to have each row represent a year (or continent) and each column represent a continent (or year), and then the values currently in the n.obs column would become the entries in each cell of the table:

year Africa Americas Asia Europe Oceania
1952 52 25 33 30 2
1957 52 25 33 30 2
1962 52 25 33 30 2
1967 52 25 33 30 2
1972 52 25 33 30 2
1977 52 25 33 30 2
1982 52 25 33 30 2
1987 52 25 33 30 2
1992 52 25 33 30 2
1997 52 25 33 30 2
2002 52 25 33 30 2
2007 52 25 33 30 2

We say that n_obs_by_year_cont is currently in long format, and we would like to pivot to a wider format. To do that, we’ll use the tidyverse function pivot_wider():

# we start with n_obs_by_year_cont
table_year_cont = n_obs_by_year_cont %>%
  pivot_wider(
    # get the names for the new columns from the continent column
    names_from = continent,
    # get the values for the new columns from the n.obs column
    values_from = n.obs
  )

Sometimes we have a wide-format table that we want to pivot longer. This will come up later as we’re plotting.

Your turn

Consider that last code chunk, copied and pasted here for clarity:

# we start with n_obs_by_year_cont
table_year_cont = n_obs_by_year_cont %>%
  pivot_wider(
    # get the names for the new columns from the continent column
    names_from = continent,
    # get the values for the new columns from the n.obs column
    values_from = n.obs
  )
  1. How would we change this code so that rows are continents and columns are years?

  2. What other code would we have to run before this code in order for it to work? In other words, if we open a fresh R Session, what is the minimum set of code we would need to run first so that we could create table_year_cont without errors? Or equivalently, if we put the above code in a Quarto document, what is the minimum set of code we would need to put in the Quarto document before this code chunk so that the Quarto document would render without errors?

Data viz

Now for some plotting.

Not that kind of plotting.

Getting started with ggplot()

By popular demand, let’s make a plot of population over time for the country of Japan.

First, review: how do you get the subset of the gapminder dataset that consists of just the Japan data?

# finish this code...
japan_data =

Cool, now let’s plot population versus time with ggplot():

ggplot(japan_data, aes(x = year, y = pop))

Weird, what do you notice? ggplot() is funny in that the first line which actually has the ggplot function only declares the initial plot area; it doesn’t make the full plot. To do that, we add a + at the end of the ggplot() line and add additional lines of code. For example, “geoms” (geometry layers) add the actual lines, points, bars, etc.:

ggplot(japan_data, aes(x = year, y = pop)) +
  geom_line()

We can combine multiple layers too as long as they make sense for the data structure:

ggplot(japan_data, aes(x = year, y = pop)) +
  geom_line() +
  geom_point()

We can make things a lot prettier and more customized too. Here are just a few examples of things we can do:

ggplot(japan_data, aes(x = year, y = pop/1e6)) +
  geom_line(color = "maroon", alpha = 0.7) +
  geom_point(color = "maroon", alpha = 0.7) +
  xlab("Year") + ylab("Population (millions of people)") +
  ggtitle("Japan's population over time") +
  theme_bw()

Plotting multiple countries

What if we want to compare several countries on the same plot? We might try the following code, but it looks bonkers. Try it and see:

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop)) +
  geom_line()

Why does the plot look like that, and how can we fix it? As part of troubleshooting, we might check what a scatterplot looks like (without connecting the lines):

ggplot(multi_country_data, aes(x = year, y = pop)) +
  geom_point()

Notice that looks reasonable. So what’s the issue with the line plot?

The issue is that the multi_country_data dataset has both multiple countries and multiple years, and currently ggplot does not know to group the time series data by country when it is deciding how to connect the points with lines. It is treating it as data to plot as a single line, when we would like a separate line for each country.

To fix this, we will specify a grouping variable using the group argument to the aes() function:

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line()

We probably also want to distinguish and label the lines somehow by what country they represent:

multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))

ggplot(multi_country_data, aes(x = year, y = pop, group = country, color = country)) +
  geom_line()

Facets (subplots)

What if we want to plot each country in a different subplot so that we can see each curve on its own scale? We can use facet_wrap():

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line() +
  facet_wrap(~ country, scales = "free")

Note that since they all share an x-axis (years), it might make sense to plot them vertically stacked so that the years line up. For this, we can use facet_grid() which allows us to arrange the plots specifically in a row or a column:

ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
  geom_line() +
  # make countries the rows
  facet_grid(country ~ ., scales = "free")

Pivoting for plots

What if we want to plot one country but three different variables: lifeExp, pop, and gdpPercap? We’d basically like to group by variable, but to do that, it has to be a column of the dataset. As in we need a column whose values are “lifeExp”, “pop”, and “gdpPercap” (or some other names for these three quantities).

To do this… yes, we will pivot longer!

japan_wide = japan_data %>%
  pivot_longer(
    cols = lifeExp:gdpPercap,
    names_to = "variable",
    values_to = "value"
  )

ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
  geom_line() +
  facet_wrap(~ variable, scales = "free")

Again, we can use facet_grid() to stack the plots so they align by year:

ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
  geom_line() +
  facet_grid(variable ~ ., scales = "free")

To see what else you can do with ggplot, check out further documentation online such as the ggplot gallery and the ggplot vignettes or “articles”. Happy exploring!